Search CORE

40 research outputs found

GLAWI, a free XML-encoded Machine-Readable Dictionary built from the French Wiktionary

Author: Hathout Nabil
Sajous Franck
Publication venue: HAL CCSD
Publication date: 11/08/2015
Field of study

International audienceThis article introduces GLAWI, a large XML-encoded machine-readable dictionary automatically extracted from Wiktionnaire, the French edition of Wiktionary. GLAWI contains 1,341,410 articles and is released under a free license. Besides the size of its headword list, GLAWI inherits from Wiktionnaire its original macrostructure and the richness of its lexicographic descriptions: articles contain etymologies, definitions, usage examples, inflectional paradigms, lexical relations and phonemic transcriptions. The paper first gives some insights on the nature and content of Wiktionnaire, with a particular focus on its encoding format, before presenting our approach, the standardization of its microstructure and the conversion into XML. First intended to meet NLP needs, GLAWI has been used to create a number of customized lexicons dedicated to specific uses including linguistic description and psycholinguistics. The main one is GLÀFF, a large inflectional and phonological lexicon of French. We show that many more specific on demand lexicons can be easily derived from the large body of lexical knowledge encoded in GLAWI

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Wiktionnaire's Wikicode GLAWIfied: a Workable French Machine-Readable Dictionary

Author: Hathout Nabil
Sajous Franck
Publication venue: HAL CCSD
Publication date: 23/05/2016
Field of study

International audienceGLAWI is a free, large-scale and versatile Machine-Readable Dictionary (MRD) that has been extracted from the French language edition of Wiktionary, called Wiktionnaire. In (Sajous and Hathout, 2015), we introduced GLAWI, gave the rationale behind the creation of this lexicographic resource and described the extraction process, focusing on the conversion and standardization of the heterogeneous data provided by this collaborative dictionary. In the current article, we describe the content of GLAWI and illustrate how it is structured. We also suggest various applications, ranging from linguistic studies, NLP applications to psycholinguistic experimentation. They all can take advantage of the diversity of the lexical knowledge available in GLAWI. Besides this diversity and extensive lexical coverage, GLAWI is also remarkable because it is the only free lexical resource of contemporary French that contains definitions. This unique material opens way to the renewal of MRD-based methods, notably the automated extraction and acquisition of semantic relations

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Ne jetons pas le Wiktionnaire avec l'oripeau du Web ! Études et réalisations fondées sur le dictionnaire collaboratif

Author: Calderone Basilio
Hathout Nabil
Sajous Franck
Publication venue: HAL CCSD
Publication date: 19/07/2014
Field of study

Wiktionnaire est l'édition française de Wiktionnary, le dictionnaire libre multilingue accessible en ligne. Satellite de Wikipédia, dont il constitue le "compagnon lexical", le projet dictionnairique reste dans l'ombre de l'encyclopédie. Fondé comme elle sur le principe du wiki, il peut être alimenté et modifié par tout internaute, avec publication immédiate. Si la ressource encyclopédique a été abondamment utilisée dans certaines disciplines, le dictionnaire collaboratif semble avoir reçu moins d'attention de la part de la communauté scientifique. Ce moindre intérêt pourrait être le fruit d'une méconnaissance ou d'un rejet a priori de l'amateurisme que l'on associe volontiers aux contributions effectuées par des naïfs. Nous présentons dans cet article quelques caractéristiques du Wiktionnaire, ainsi que des réalisations issues de cette ressource. Ce travail entend illustrer les possibilités offertes par ce dictionnaire singulier et permettre de décider si l'on peut tirer ou non bénéfice de son exploitation, et pour quel usage. Plus précisément, nous questionnons la légimité des ressources approvisionnées "par les foules" et nous étudions dans quelle mesure le Wiktionnaire peut, par ses spécificités, compléter les ressources dictionnairiques existantes dans le cadre d'études linguistiques et, d'autre part, servir de point de départ à la constitution d'un lexique électronique pour des domaines comme le traitement automatique des langues et la psycholinguistique. Notre contribution à la caractérisation du Wiktionnaire s'accompagne de la mise à disposition de deux lexiques construits à partir du dictionnaire collaboratif. Le premier est un lexique morphophonologique à très large couverture. Destiné notamment aux applications de TAL, nous donnons des exemples possibles d'utilisation en linguistique outillée. Le second est un lexique orienté vers la psycholinguistique. Dérivé du premier, il contient moins d'entrées, mais comprend pour chacune d'elle un ensemble d'informations habituellement utilisées dans cette discipline. Ces lexiques sont à la fois sont téléchargeables et interrogeables en ligne

Scientific Publications of the University of Toulouse II Le Mirail

EDP Sciences OAI-PMH repository (1.2.0)

HAL Descartes

Enrichissement de lexiques sémantiques approvisionnés par les foules : le système WISIGOTH appliqué à Wiktionary

Author: Gaume Bruno
Navarro Emmanuel
Sajous Franck
Publication venue: ATALA (Association pour le Traitement Automatique des Langues)
Publication date: 01/01/2011
Field of study

International audienceSemantic lexical resources are a mainstay of various NLP applications. However, comprehensive and reliable resources rarely exist or are often not freely available. We discuss in this paper the context of lexical resources building and the problems of evaluation. We present Wiktionary, a freely available and collaboratively built multilingual dictionary and we propose a semi-automatic approach based on random walks for enriching its synonymy network, which uses endogenous and exogenous data. We then propose a validation "by crowds". Finally, we present an implementation of this system called WISIGOTH.Bien que de nombreuses applications de TAL reposent sur des ressources lexicales sémantiques, celles-ci sont rarement simultanément de qualité satisfaisante et librement disponibles. Partant de la confrontation entre méthodes traditionnelles et tendances émergentes de construction et d'évaluation de ressources lexicales, nous présentons dans cet article une nouvelle méthode fondée sur Wiktionary, un dictionnaire multilingue libre, disponible en ligne et construit collaborativement, puis nous proposons un enrichissement semi-automatique de son réseau de synonymie utilisant des données endogènes et exogènes, recourant à une validation " par les foules ". Nous décrivons enﬁn une implémentation de ce système baptisée WISIGOTH

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Looking for French deverbal nouns in an evolving Web (a short history of WAC)

Author: Hathout Nabil
Sajous Franck
Tanguy Ludovic
Publication venue: HAL CCSD
Publication date: 07/09/2009
Field of study

International audienceThis paper describes an 8-year-long research effort for automatically collecting new French deverbal nouns on the Web. The goal has remained the same: building an extensive and cumulative list of noun-verb pairs where the noun denotes the action expressed by the verb (e.g. production - produce). This list is used for both linguistic research and for NLP applications. The initial method consisted in taking advantage of the former Altavista search engine, allowing for a direct access to unknown word forms. The second technique led us to develop a specific crawler, which raised a number of technical difficulties. In the third experiment, we use a collection of web pages made available to us by a commercial search engine. Through all these stages, the general method has remained the same, and the results are similar and cumulative, although the technical environment has greatly evolved

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

From GLÀFF to PsychoGLÀFF: a large psycholinguistics-oriented French lexical resource

Author: Calderone Basilio
Hathout Nabil
Sajous Franck
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

International audienceIn this paper, we present two French lexical resources, GLÀFF and PsychoGLÀFF. The former, automatically extracted from the collaborative online dictionary Wiktionary, is a large-scale versatile lexicon exploitable in Natural Language Processing applications and linguistic studies. The latter, based on GLÀFF, is a lexicon specifically designed for psycholinguistic research. GLÀFF, counting more than 1.4 million entries, features an unprecedented size. It reports lemmas, main syntactic categories, inflectional features and phonemic transcriptions. PsychoGLÀFF contains additional information related to formal aspects of the lexicon and its distribution. It contains about 340,000 entries (120,000 lemmas) that are corpora-attested. We explain how the resources have been created and compare them to other known resources in terms of coverage and quality. Regarding PsychoGLÀFF, the comparison shows that it has an exceptionally large repertoire while having a comparable quality

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Acquisition and enrichment of morphological and morphosemantic knowledge from the French Wiktionary

Author: Calderone Basilio
Hathout Nabil
Sajous Franck
Publication venue: Association for Computational Linguistics and Dublin City University
Publication date: 01/01/2014
Field of study

International audienceWe present two approaches to automatically acquire morphologically related words from Wiktionary. Starting with related words explicitly mentioned in the dictionary, we propose a method based on orthographic similarity to detect new derived words from the entries' definitions with an overall accuracy of 93.5%. Using word pairs from the initial lexicon as patterns of formal analogies to filter new derived words enables us to rise the accuracy up to 99%, while extending the lexicon's size by 56%. In a last experiment, we show that it is possible to semantically type the morphological definitions, focusing on the detection of process nominals

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

GLÀFF, un Gros Lexique À tout Faire du Français

Author: Calderone Basilio
Hathout Nabil
Sajous Franck
Publication venue: HAL CCSD
Publication date: 17/06/2013
Field of study

International audienceThis paper introduces GLÀFF, a large-scale versatile French lexicon extracted from Wiktionary, the collaborative online dictionary. GLÀFF contains, for each entry, a morphosyntactic description and a phonetic transcription. It distinguishes itself from the other available lexicons mainly by its size, its potential for constant updating and its copylefted license that makes it available for use, modification and redistribution. We explain how we have built GLÀFF and compare it to other known resources. We show that its size and quality are strong assets that could allow GLÀFF to become a reference lexicon for NLP, linguistics and psycholinguistics.Cet article présente GLÀFF, un lexique du français à large couverture extrait du Wiktionnaire, le dictionnaire collaboratif en ligne. GLÀFF contient pour chaque entrée une description morphosyntaxique et une transcription phonémique. Il se distingue des autres lexiques existants principalement par sa taille, sa licence libre et la possibilité de le faire évoluer de façon constante. Nous décrivons ici comment nous l'avons construit, puis caractérisé en le comparant à différentes ressources connues. Cette comparaison montre que sa taille et sa qualité font de GLÀFF un candidat sérieux comme nouvelle ressource standard pour le TAL, la linguistique et la psycholinguistique

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Évaluation sur mesure de modèles distributionnels sur un corpus spécialisé : comparaison des approches par contextes syntaxiques et par fenêtres graphiques.

Author: Hathout Nabil
Sajous Franck
Tanguy Ludovic
Publication venue: ATALA (Association pour le Traitement Automatique des Langues)
Publication date: 01/01/2015
Field of study

International audienceDistributional semantics models can be built using simple bag-of-word representation of a word's contexts (window-based) or using more complex syntactic information (syntax-based). Previous studies have compared their relative efficiency without coming to a definitive conclusion, but such examination has never been performed on small and specialised corpora. We have run a set of such comparative experiments based on a collection of French NLP articles and a custom-made gold standard. These experiments show a better global performance of syntax-based models, as long as syntactic information is processed with appropriate care.Il est possible de construire des modèles distributionnels en ne considérant que la cooccurrence graphique entre les mots, ou bien en utilisant des relations syntaxiques de complexité variable. Si des comparaisons systématiques n'ont jamais pu trancher définitivement en faveur de l'une ou de l'autre, elles ont rarement été menées sur un corpus de taille réduite ou en langue de spécialité. Nous proposons ici une palette d'expériences visant l'observation d'un ensemble de modèles distributionnels construits à partir d'un petit corpus d'articles en français dans le domaine du TAL. Un jeu de données a été spécifiquement conçu pour l'évaluation des différentes configurations. Ces expériences montrent que les modèles qui prennent en compte de façon raisonnable les informations syntaxiques obtiennent globalement de meilleurs résultats

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce.

Author: Calderone Basilio
Hathout Nabil
Sajous Franck
Tanguy Ludovic
Publication venue: HAL CCSD
Publication date: 17/09/2012
Field of study

International audienceWe describe here the technical details of our participation to PAN 2012's "traditional" authorship attribution tasks. The main originality of our approach lies in the use of a large quantity of varied features to represent textual data, processed by a maximum entropy machine learning tool. Most of these features make an intensive use of natural language processing annotation techniques as well as generic language resources such as lexicons and other linguistic databases. Some of the features were even designed specifically for the target data type (contemporary fiction). Our belief is that richer features, that integrate external knowledge about language, have an advantage over knowledge-poorer ones (such as words and character n-grams frequencies) when training data is scarce (both in raw volume and number of training items for each target author). Although overall results were average (66% accuracy over the main tasks for the best run), we will focus in this paper on the differences between feature sets. If the "rich" linguistic features have proven to be better than trigrams of characters and word frequencies, the most efficient features vary widely from task to task. For the intrusive paragraphs tasks, we got better results (73 and 93%) while still using the maximum entropy engine as an unsupervised clustering tool

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes